NLP and Topic Models

Chris Bail, PhD
Computational Sociology Duke University

Recap

 

Last class, we learned how to collect vast amounts of text based data (and non-text data)

Recap

 

But collecting lots of text-based data will quickly make you overwhelmed once you realize that you cannot possibly read it all.

Today's Agenda

 

Thankfully, computer scientists and computational linguists have produced a variety of exciting new tools for automated content analysis.

Today's Agenda

 

Unfortunately, these techniques require quite a few steps. First, we need to transform words into numbers, and then we need to go over some common pitfalls of topic models

Today's Agenda

 

1) Creating a corpus
2) What is topic modeling?
3) Common pitfalls
4) Running R in the cloud (if we have time)

CREATING A CORPUS

First, Set Your Working Directory

 

setwd("/Users/christopherandrewbail/Desktop/Dropbox/TEACHING/Computational Soc Fall 2015/Course Dropbox")

The tm Package

 

install.packages("tm")
library(tm)

Let's Load some Data

 

Let's read in some political blogs data I put in our dropbox:

blog_data<-read.csv("poliblogs2008.csv", stringsAsFactors = FALSE)

Let's take a peek:

 

colnames(blog_data)

And another one:

 

blog_data$documents[12]
[1] "Historian Richard Brookhiser puts is succinctly at NRO's The Corner this morning:But a man who could not have used certain restrooms forty years ago is in the center ring, not as a freak in the manner of Alberto Fujimori or Sonia Gandhi, nor even as a faction fighter in the style of Jesse Jackson, but as a real player. One of our great national sins is being obliterated, as the years pass, by the virtues of our national system. I don't agree with Obama and I don't particularly like him, but I am proud of this moment. Those of us of a certain age may be surprised that an even bigger deal isn't being made out of the fact that an African American just won a huge victory in a state that is 96% white. Other pundits are marveling at the Obama phenomenon with equal surprise: David Brooks: Barack Obama has won the Iowa caucuses. You’d have to have a heart of stone not to feel moved by this. An African-American man wins a closely fought campaign in a pivotal state. He beats two strong opponents, including the mighty Clinton machine. He does it in a system that favors rural voters. He does it by getting young voters to come out to the caucuses. This is a huge moment. It’s one of those times when a movement that seemed ethereal and idealistic became a reality and took on political substance. Iowa won’t settle the race, but the rest of the primary season is going to be colored by the glow of this result. Whatever their political affiliations, Americans are going to feel good about the Obama victory, which is a story of youth, possibility and unity through diversity — the primordial themes of the American experience. Peggy Noonan: As for Sen. Obama, his victory is similarly huge. He won the five biggest counties in Iowa, from the center of the state to the South Dakota border. He carried the young in a tidal wave. He outpolled Mrs. Clinton among women. He did it with a classy campaign, an unruffled manner, and an appeal on the stump that said every day, through the lines: Look at who I am and see me, the change that you desire is right here, move on with me and we will bring it forward together.Andrew Sullivan: Look at their names: Huckabee and Obama. Both came from nowhere - from Arkansas and Hawaii. Both campaigned as human beings, not programmed campaign robots with messages honed in focus groups. Both faced powerful and monied establishments in both parties. And both are running two variants on the same message: change, uniting America again, saying goodbye to the bitterness of the polarized past, representing ordinary voters against the professionals. Neither has been ground down by long experience, but neither is a neophyte. You have a Republican educated in a Bible college; and a Democrat who is the most credible African-American candidate for the presidency in history. Their respective margins were far larger than most expected. And the hope they have unleashed is palpable. E.J. Dionne: Change, particularly generational change, was also at the heart of Barack Obama's victory over Hillary Rodham Clinton and John Edwards. Young voters and independents flocked to the Illinois senator. Media entrance polls showed that Obama defeated Clinton by better than 5 to 1 among voters under age 30, and such voters made up almost as large a share of the caucus electorate as voters over 65, a strongly pro-Clinton group. Among independents, Obama beat Clinton by better than 2 to 1. Matthew Yglesias: I think the manner of Barack Obama's win is pretty impressive. I can't be the only one who was a bit inclined toward a cynical roll of the eyes at the idea of winning on the back of unprecedented turnout, mobilizing new voters, brining in young people, etc. That sounds like the kind of thing that people say they're going to do but never deliver on. But he did deliver. That's impressive. Perhaps the best line written about last night's Obama win is a touch more negative. From Powerline: CONCLUDING THOUGHTS: Iowa has given its seal of approval to (1) a one-term Senator who stands for \"hope\" and \"change\" and (2) a tacky, big spending governor who doesn't know much about foreign policy but did stay at a Holiday Inn Express. The common demoninator here, other than a patent lack of qualifications for the presidency, is likeability. Well done, (small fraction of) Iowa."

The Joys of Character Encoding

 

blog_data$documents <- iconv(blog_data$documents, "latin1", "ASCII", sub="")

Create a Corpus

 

blog_corpus <- Corpus(VectorSource(as.vector(blog_data$documents))) 

From Words to Numbers..

Topic Models

"Pre-Processing"" Text

 

blog_corpus <- tm_map(blog_corpus, content_transformer(removePunctuation)) 

"Pre-Processing"" Text

 

blog_corpus <- tm_map(blog_corpus,  content_transformer(tolower)) 

"Pre-Processing"" Text

 

blog_corpus <- tm_map(blog_corpus , content_transformer(stripWhitespace))

Stop Words

 

stoplist <- read.csv("english_stopwords.csv", header=TRUE, stringsAsFactors = FALSE)
stoplist<-stoplist$stopword
blog_corpus  <- tm_map(blog_corpus , content_transformer(removeWords), stoplist)

Stemming

 

blog_corpus  <- tm_map(blog_corpus , content_transformer(stemDocument), language = "english")

Document Term Matrix

 

Blog_DTM <- DocumentTermMatrix(blog_corpus, control = list(wordLengths = c(2, Inf)))

Inspect the Document-Term Matrix

 

inspect(Blog_DTM[1:20,1:20])

Remove Sparse Terms

 

DTM <- removeSparseTerms(Blog_DTM , 0.990) 

I've now removed terms that only appear in .01 of all documents.

Inspect the Popular Terms

 

The following line finds all the words that occur more than 3,000 times in the dataset:

findFreqTerms(Blog_DTM, 3000)

Assigning the Number of Topics

 

k<-7

The Topic Models Package

 

library(topicmodels)

Setting Control Parameters

 

control_LDA_Gibbs <- list(alpha = 50/k, estimate.beta = T, 
                          verbose = 0, prefix = tempfile(), 
                          save = 0, 
                          keep = 50, 
                          seed = 980,  for reproducibility
                          nstart = 1, best = T,
                          delta = 0.1,
                          iter = 2000, 
                          burnin = 100, 
                          thin = 2000) 

Our First Topic Model

 

my_first_topic_model <- LDA(Blog_DTM, k, method = "Gibbs", control = control_LDA_Gibbs)

Getting the most Popular Terms by Topic

 

terms(my_first_topic_model, 30)

Determining K (the Number of Topics)

 

many_models <- mclapply(seq(2, 35, by = 1), function(x) {LDA(Blog_DTM, x, method = "Gibbs", control = control_LDA_Gibbs)} )

(Hat tip to Achiim Edelman for this nice function.)

Plotting the log likelihoods

 

many_models.logLik <- as.data.frame(as.matrix(lapply(many_models, logLik)))

We can then plot the results to see where we get decreasing returns for increasing the number of topics:

plot(2:35, unlist(lda.models.gibbs.logLik), xlab="Number of Topics", ylab="Log-Likelihood")

Then We Repeat...

 

k<-10
my_first_topic_model <- LDA(Blog_DTM, k, method = "Gibbs", control = control_LDA_Gibbs)

Finding the Topic Assignments

 

This line tells us which document is assigned to which topic:

topic_assignments_by_docs <- topics(my_first_topic_model)

COMMON PITFALLS

Choosing 'k'

 

It is easy to “read the tea leaves.”

Once you have assignments, what should you do with them?

 

Probabilities in regression models?

Dichotomous dummies?

What's the cut-off?

Check out Structural Topic Models!

 

The stm package is very useful for using meta- data to improve the accuracy of document classification. This can be particularly important if you are trying to make claims about change over time

RUNNING R IN THE CLOUD

Running R in the Cloud

Load R Studio in your Web Browser!

Running R in the Cloud

 

Why?

  • 100+ times the computing power
  • Cheaper than buying a faster machine
  • Ability to load multiple R sessions at once, or “parallelize” your code
  • Access your R code/data from anywhere
  • TOPIC MODELING REQUIRES SIGNIFICANT COMPUTING RESOURCES!

Running R in the Cloud

 

Why?

-Possible downsides are: a) may not be safe for sensitive data; and b) you have to pay for the time you use… but some are free and even the most expensive are not too costly (e.g. ~$5/hour)

Running R in the Cloud

First, you need to create an account

Running R in the Cloud

Next, load an RStudio “Amazon Machine Image” (AMI)

Running R in the Cloud

This will redirect you to this page where you can choose how much power you want:

Running R in the Cloud

Configure your security group to allow incoming HTTP traffic from Port 80

Running R in the Cloud

Create a Key Pair (security measure)

Running R in the Cloud

Once you launch:

Running R in the Cloud

The EC2 Console

Running R in the Cloud

Cut and paste the “Public DNS” address into your browser

4.6 Running R in the Cloud

Log in: By Default the user name and password are both set to “rstudio”

Running R in the Cloud

All done!

Running R in the Cloud

 

Change your password!

Proceed with caution when loading sensitive data into the cloud!

Always “Shut down your instances,” OR YOU MAY RUN UP A BIG BILL!!!

Homework

 

Run some topic models on whatever text-based data you collected last week!

 

NEXT WEEK:

Visualization

Now that we've collected and classified data, this class will teach you how to analyze them using basic visualization techniques (e.g. scatterplots, line graphs, and bar charts). This class is therefore designed to be a “launch point” for some of R's more stunning visualization capabilities (heatmaps, network diagrams, streamgraphs, etc.) that Kieran will cover in his visualization seminar this fall